DOCUHENT RESURE - 
: TH 008 775 


- Lai, Morris K.. : ee : eg - are 
. Rales. of Thuab from the Literature on Research and 
Bvaluation. ; Me . Sed 
Apr’ 79. CT ra 
10p.; Paper -preSented at the Annual Meeting of the 
American Educatiqnal’ Research Association (63rd, San 
at a Francisco, Califgrnia, April 8-12, 1979) | ¢ 


-4P01/PC01 Plus Postage... 
*Evaluation Methods; *Research. Criteria; *Research 
Net hodology; *Research Probleas; *Testing 


Pa . 
> 


@ Practical advice on frequently asked questions 
Gealing with researeh and evaluation sethodology is presented as. 
tules of thumb, with citations to the author's sources. A statement. 
in the literature is considered a rule of thuab ig it aeets one of. 
“khe following criteria: (1) it is specifically called a rule of ~ 
thumws (2) it contains numbers in place of,algebraic symbols; or (3) 
“&t- contains a reference to previous successes using a particular 
level. The’ rules included here deal with article title, budgeting for 
fesearch staff, test difficulty and discrisination, distribution, 

.- Significance, Fisher test, reliability of gain scored, interrater 
-Peliability, item construction, testing time, response rate, | ; 

_ Obgervation, sample size, sampling, skewness, test wiseness,. 
_ test-retest reliability, and test revision. (MA) 
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Introduction 
‘Almost every researcher aa evaluator has had the frustrating experience of 
| “ seeking quickly needed practical advice, but ending up with either long drawn . 
out discourses or complicated formulas that are of little practical use. How . 
; many times have the following questions been asked, but no usable answers have 
: , _, been given: , How large a sample should I use? » How many items should be. put on 
” the test? How reliable should ‘the test be? What constitutes an educationally 
| “significant difference? Even if ballpark answers are requested to questions like 
these, more often than not, practical answers are not forthcoming. Sometimes: 
when answers are suggested, it is evident that personal bias has in a eS way" 
! zs “influenced the response. | oe - _ - . 
“8G See Although consi tants or textbooks may validly respond toa ractitionert S$ | 
~ “questions by saying, VIt depends: “4 oftentimes ’ a reasonable estihate or ballpark 
a _‘figure would be more appropriate. In fact, ‘a rule of thumb’ based Gn previous: 
‘empirical and theoretical results may in many cases: - be even more "correct" 
than a rule that resuits from exact, complicated formulas which are based. on 
questionable assumptions. 7 | fee, 
The following question is perhaps one of the best ways of. illustrating the. 
perspective being taken in this paper: If you were developing a test and 
_Wwanted to know how many subjects should be in a formal tryout for item. analyses, 
which type of response would you prefer? a) "It all depends , but here..is a 600 
page book“ on tests and measurements," or b) "Henrysson (in Thorndike’ 's Educa- 


tional measurement, 1971) recommends that at least 300 subjects be Gea: o 


Although some will insist that the first Fesponse is more defensible; eae others 
- would benefit mach ore from the second response. 
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Method. and data source : ; 
est The uncovering» of rules of thumb is of course a iver endiie task; however, 
as a start, the. author went (through as many educational~research and. evaluation Se 
books that he could find .in a) his own collection, »b) in the University of Hawai" i 
library, c) in “the collections of colleagues, and’d) in the many research and : 
evaluation projects at ‘the Curriculum Research and Development Group ‘of the 
University of Hawai’ BG Concurrently a search. of the most appropriate journals 
+(@-g, Psychological ‘Bulletin, Review of Educational Research, American Educational, . 
Research Journal) ‘was carried out. “For the past five years all. promising AERA, — 
- wheeting articles were sent for and read. ' Finally a retroactive ERIC search was 
carried out. In order to make the task of reasonable one, publications’ before: 
.1970 were. in general not included. . | 
Given this vast amount of data, it was necessary to skim rapidly over all, 
parts . of the material which did not constitute rules of thumb. In selecting. - 
“mules ¢ of thumb the following definition was used (admittedly with some flexibiniey)-- 
a statement was considered a rule of ‘thumb if any of the following were true: 
a) it was sigs Chis specifically as a rule of thumb, b) it was a suggestion 
that contained actual numbers in place of algebraic symbols (es Bes "Dp should be 

between .2 and 8"), Okc) a “reference was made to previous successes ‘using a ~ 
particular level {e.g., ''So and-so found that at least 1000 subjects were needed 


for a national sample."). 


Nes the statements that were not classified as rules of thumb were 
1) results based on a single study (and reported as such) , 2) genetal recommends- 
tions (e.g:, "Involve the evaluator at the peinding of the project."') 3) rules 
"which were likely to be or become outdated (e.g. ,< "Expect to sperid $° ad 
in carrying out. the following task.) 4) rules whose content was too exotic or 
unique to ia of interest to many practitioners (e. g- , Glass et. al. , 1975, page 96-- 


On partial gitocorteiations: "Perhaps only the first two or three autocorrela- 
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a tions can 1 be adequately estinated = (S. 36) with even relatively long series 


-(ors0 to 100). “}.: 


s 


| Those statements which eziea the definition screening were then recorded 
on 4x6 cards and classified by author as well as content area. ‘Where conflicting 
rules. of thumbs were found, all were included: In many cases authors presented 


ee Tules which were. referenced to other aaehone. A decision was made to cite both 


he, erence 3 in which the rule was found as well as the author to whom the 


. rule was attributed}, wever, for. pragmatic reasons, only the reference in which 


the, Tule was actually read will be listed at the end of the complete treatise. 
@. gi Ebel: -in Ahmami a Glock, 1971: The difficulty level of test items should be 
between 40. ‘and .70) 


In éoing the research it — apparent that in the best tradition of 


oral transmission of culture many rules-of thumb . have been passed scant through 


the years, oftentimes ‘without reference to the. rule's originator(s). When 


these cases have arisen no- serious Beene has been made to track down the true 
source. “Instead an often arbitrary representative has, been selected to receive 
credit or blame, if ngt as the ence of the given rule, then as a perpetua- ' 


tor. 


In ty attempts ‘to develop and describe the methods. used in compiling « a Sorts 


‘lection of rules of thumb I have been somewhat influenced by Jackson 1978) who - 


forcefully argued that methods used in reviewing research should be made explicit. 


He also discussed the = ways in which the methodology of such reviews could 


be improved. In the current attempts to put togetherra compendium of rules of 


a“ 
thumb , it became quite apparent that the methodology” of compiling was perhaps 


as important as the rules themselves. As alluded to earlier, the method used 


could affect the number and type of rules. selected, the classification of the - 


rules, the author referenting, etc. 
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“index of at least .30 (Amann § Glock, 1971, p. 189). 


‘(Marascuilo, 1971, p. 179). 


: complete presentation in this paper. Instead .the following araaples : 
are ‘piven to-help the reader decide whether or not to aac one Satipretiensive: 
_ compendive: that ee be available in the shear future. 
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Article title. ee Jength should be 12-15 words (APA, 1974, p..14). 


_ Budget. For RFP's (requests for proposal) , person years are frensisteg at times 
"into $25,000 to over $50,000. (Scriven & Roth, 1977, p. 17): , 


Difficulty level. The difficulty level for items on classroom tests should be 
between .4 and .7 (Ebel, R.\L. in Ahmann & Glock, ia SS is of Be 
Discrimination index. A reasonably good ba ta test item should have an 
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Distribution. Group observed, outcomes sate) 8 to equal -width intervals : ; 


a i . e , co the ’ r 
Educationally significant. A difference is educationally meaningful if-it is 


2.1/3 of a standard deviation (or ‘sometimes > 1/4 s.d.) or rate of growth | 
produces 4 post percentile greater than the pre percentile by one standard 
error (Tal tnadge, 1977, p. 34).° 0 | a Fg 


Fisher Test. ‘If N < 20, use the Fisher Test (instead of x? a 11 cases 
Siegel, 1986, p- 110). 


Gain score Gea Siete For ‘N 2 30, gain store means are probably quite 
reliable (Martuza, 1977, p. 2 . | . | 


~ Fy 


Interrater reliability. a) Should be’ at “Teast .70 (Borg. & Gall, 1971, p. 235). 
b) Observers should be in perfect agreement * 503 of thi time (Borich, 1974, p. 259). 


M 


pas HEN 
i : 


Page 5 


, 


-” Construct a sas than ee tare 1976, p. 30). 


s : ‘Item, test tine. Mietage high school studerit should be able to answer two 
_ true-false iteas, one ‘multiple choice item, or’ one short answer T iten per 
“ mimute of agian aan (Groniund; 1971, p. 240)._ 


ees A questionnaire enero rate of less than 20% can be 


_Tessonably ignored (Isaac, § Michael, 1971, p. #3). as * 
+ dbsexvations . For: practical purposes of estimation two repeated independent ey ee = 
"observations on any person are typically subtictone (Novick & Jackson, 1974, =: Ne 
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*§ Respons se rate. For mail surveys, an 80%. return is acceptable (Sudan, 1976, - 


5 


P: 30. yo. = - Be * ie, ae 


ae ceo ‘saple sige-cohort study. . Need an.N.of 00- 1000. per concert for’ a 3- eveer study: | 
(Cooley & Lohnes 1976, p. 137). am 


oy 7 


‘ Sampling correction. Finite seaiation correction can be ignored ‘whenever the A 


te8 ‘sampling fraction does not exceed 5 (and for Ran pUposes even if it is as 


a 


high ag 10%) (Cochran, 1977, p. 25). 


‘ . on : : 7 Joa) : 
_.’ -Skewness, computation of. “Sas to compute By (fieasure of skewness) 
when N < 100 (McNenar, 1969, p. aT). | | 
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Test. practice effect. Usually ampreves scores at the he Second yesting by se more 1 
£ 
than — 5 (Anderson, * et. al. ; ws ; 


“= Test-retest_ reliability. Should be >.85 es ay 243). 


Test TeviSion. Given a fairly broad level 6f shes S to 8 stindents can give 


— 


gj geonsiderable revision a (Btoon, et. al., 1971). - oY a 4 
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‘Some @ precsutions are in ander for users, of rules ‘of thumb. In many cirs 


rule ‘of tpusb is being applied. Some ules of thumb may represent’ remants from. 
mythology er traditional ignorance (cf. Aristotle). Other rules of thumb: may 
have come from persons brash enough to generalize from a@ single study, A | 
given statenent might also have hidden ' in it the peculiar values of an individual 


(eg. a relatively conservative writer May present more stringent. rules of 


thumb). _ The user of the rule, must , therefore, . take into account the source or 


” perpetuator of the nule. a a : 


Despite. these alia it appears that rules of thumb: can be of substantial 
benefit ae especially to the practitioner. Lest I incur the wrath of the ‘ynion 
of high-cost consultants, I would. also add that dersenal expert advice can 


still be extremely ‘valuable. In fact it voulan®t be a bad idea to approach 


resedrch/evaluation problens through thd use of a combination of # consultant 


and a nile of thumb. At least with this combination, the researcher. will 


. never end up with the far-too- - frequent “accurrence. of paying for advice, but | 7. 


“nat knowing what to do next. Of course the practitioner should make sure that 
the consultant doesn’ t end up charging | a lot for merely selene a mule of 
* thumb that the user already knows. about... | . 
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In keeping with the spirit of this paper, I will end with two final rules 


" of. thumb : 1) 95% of. paper presenters at conferences go over: their allotted 


time. (or should it read: Not enough time is allotted for 95% of the paper 


presentations), and 2). Listeners or readers, part to fall asleep after about 


is 8 pages of a conference paper (Lai, 1979 AERA Conference). In order to retain 


my status as’ a bonafide user :of rules of thumb, I hereby end this presentation 


and invite all of you to join the society of thumb rulers.’ 


 pustances, it will ‘be important to get more ‘details to insure that, an ‘appropriate 
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